Matrix Bidiagonalization on the Trident Processor
نویسندگان
چکیده
This paper discusses the implementation and evaluation of the reduction of a dense matrix to bidiagonal form on the Trident processor. The standard Golub and Kahan Householder bidiagonalization algorithm, which is rich in matrix-vector operations, and the LAPACK subroutine _GEBRD, which is rich in a mixture of vector, matrix-vector, and matrix operations, are simulated on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector, matrix-vector, and matrix operations needed for bidiagonalizing a matrix. The number of clock cycles per FLOP is used as a metric to evaluate the performance of the Trident processor. Our results show that increasing the number of the Trident lanes proportionally decreases the number of cycles needed per FLOP. On a 32K 32K matrix and 128 Trident lanes, the speedup of using matrix-vector operations on the standard Golub and Kahan algorithm is around 1.5 times over using vector operations. However, using matrix operations on the _GEBRD subroutine gives speedup around 3 times over vector operations, and 2 times over using matrix-vector operations on the standard Golub and Kahan algorithm.
منابع مشابه
BLAS on the Trident Processor: Implementation and Performance Evaluation
This paper describes the implementation of the Basic Linear Algebra Subprograms (BLAS), which are widely used in many applications, on the Trident processor. We show how to use the Trident parallel execution units, ring, and communication registers to effectively perform vector-vector, matrix-vector, and matrix-matrix operations needed for implementing BLAS. The TFLOPS rate on infinite-size pro...
متن کاملTrident: A Scalable Architecture for Scalar, Vector, and Matrix Operations
Within a few years it will be possible to integrate a billion transistors on a single chip. At this integration level, we propose using a high level ISA to express parallelism to hardware instead of using a huge transistor budget to dynamically extract it. Since the fundamental data structures for a wide variety of applications are scalar, vector, and matrix, our proposed Trident processor exte...
متن کاملBidiagonalization with Parallel Tiled Algorithms
We consider algorithms for going from a “full” matrix to a condensed “band bidiagonal” form using orthogonal transformations. We use the framework of “algorithms by tiles”. Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the R-bidiagonalization algorithm R-BiDiag, which is a ...
متن کاملLarge-scale Inversion of Magnetic Data Using Golub-Kahan Bidiagonalization with Truncated Generalized Cross Validation for Regularization Parameter Estimation
In this paper a fast method for large-scale sparse inversion of magnetic data is considered. The L1-norm stabilizer is used to generate models with sharp and distinct interfaces. To deal with the non-linearity introduced by the L1-norm, a model-space iteratively reweighted least squares algorithm is used. The original model matrix is factorized using the Golub-Kahan bidiagonalization that proje...
متن کاملDivide and Conquer Low-rank Preconditioning Techniques
This paper presents a preconditioning method based on a recursive multilevel lowrank approximation approach. The basic idea is to recursively divide the problem into two and apply a low-rank approximation to a matrix obtained from the Sherman-Morrison formula. The low-rank approximation may be computed by the partial Singular Value Decomposition (SVD) or it can be approximated by the Lanczos bi...
متن کامل